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Abstract 

This paper deals -with the problem of finding the globally optimal subset of h elements from a larger 
set of n elements in d space dimensions so as to minimize a quadratic criterion, ■with an special emphasis 
on applications to computing the Least Trimmed Squares Estimator (LTSE) for robust regression. The 
computation of the LTSE is a challenging subset selection problem involving a nonlinear program with 
continuous and binary variables, linked in a highly nonlinear fashion. The selection of a globally optimal 
subset using the branch and bound (BB) algorithm is limited to problems in very low dimension, tipically 
d < 5, as the complexity of the problem increases exponentially with d. We introduce a bold pruning 
strategy in the BB algorithm that results in a significant reduction in computing time, at the price of a 
negligeable accuracy lost. The novelty of our algorithm is that the bounds at nodes of the BB tree come from 
pseudo-convexifications derived using a linearization technique with approximate bounds for the nonlinear 
terms. The approximate bounds are computed solving an auxiliary semidefinite optimization problem. We 
show through a computational study that our algorithm performs well in a wide set of the most difficult 
instances of the LTSE problem. 

Keywords: Global Optimization; Integer Programming; High Breakdown Point Regression; Branch and 
bound; Relaxation-Linearization Technique. 


1. Introduction 


In this article we deal with a nonlinear subset selection problem arising in the computation of linear 
regression estimators with strong robustness properties. 

Before entering into the details of the problem we introduce the problem of robust estimation through 
an example from Rousseeuw and Leroy (19871. In Figure]^ (left) we show, for each year from 1950 to 1973, 
the number of outgoing international phone calls from Belgium. The bulk of the data follows a linear model; 
nonetheless, there are 6 observations that deviate from the majority. In fact, during the period between 
1964 and 1969, there was a change on the record system, which actually recorded the total duration, in 
minutes, of the international phone calls instead of the number of calls. We plot the regression line obtained 
by a robust method (solid line) and that obtained by the method of Least Squares (dashed line). The Least 
Squares estimation is strongly affected by the outliers. 

A more appealing example ( [Jalali-Heravi and Konouz 20021 where a subpopulation acts differently, is 
shown in Figure (right). There are plotted, for 32 chemical compounds, a quantity called Krafft point 
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Figure 1: Some real data containing outliers. At the left: phone calls from Belgium; at the right: Krafft point of chemical 
compounds. 


versus a molecular descriptor called heat of formation. There is a main group that follows a regression line, 
correctly estimated by a robust estimator (solid line); there is also, besides some few outliers at right, a 
second, smaller group forming what seems to be another regression line. The observations in the second 
group correspond to sulfonates. The Least Squares estimator (dashed line) is not helpful to detect the 
presence of the second group. 

As the reader can figure out, in higher dimensions, where visual inspection is not longer an alternative 
for detecting outliers, specifically suited methods are needed to deal with outliers. This is what robust 
estimators are about. Besides robustness, in a sense to be specified soon, robust regression estimators 
satisfy statistical properties such as asymptotic normality, square-root rate of convergence and equivariance 
properties (Maronna et al. 20061. Unfortunately, the use of robust estimators is not as widespread as one 
may expect because their computation is very time-consuming. Unlike other problems arising in Statistics, 
the difficult problems involved in computing robust estimators remain almost unknown to O.R. specialists. 


1.1. Description of the motivating problem 

We have at our disposal a sample consisting of n observations of the explicative variables {xi, ...,Xn} C 
To each observation of explicative variables it corresponds a response or dependent variable, gathered 
together in a vector y = (j/i, ...,?/„) € K". We assume that the random variables x and y are related through 
a linear model, which implies the existence of a vector /3 € such that 

yi = xJ/3 + di, (1) 


with Si i.i.d. , E[Si] = 0, Var[Si] = a^. The objective of linear regression is to estimate the parameter (3. 
For convenience, we put the explicative variables as rows of a matrix X G 


A = 


jxi\ 

X2 


... 

_ 





X 


XI 

(1) ^(2) 


\Jyn -^n 


M) 


Ad) 


For /3 S we denote by r(/3) the vector of residuals r = y — A/3, with components ri = yi — xj(3- The 
Least Squares (LS) estimator is obtained by minimizing the sum of the squared residuals: 


min rAB)"^. 
2=1 
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Hereafter we adopt the robustness notion introduced by Donoho and Huber (19831, which is based on 


the concept of Breakdown Point (BDP). The BDP of an estimator on a sample is defined as the minimum 
fraction of observations that need to be replaced by arbitrary ones for the estimator to take on arbitrary 
valnes. The BDP of the common LS estimator is 1/n, since it suffices to control just one observation to make 
the estimator divergent. Therefore, the asymptotic BDP of the LS estimator as n grows to infinity is 0%. 
The same is trne if the LS estimator is replaced by any estimator obtained by minimizing a convex function 


of the residnals. Since the pioneer Least Median of Squares (LMS) estimator (see Rousseeuw and Leroy 


1987 for a precise description), there has been a continuons improvement leading to high BDP estimators 
with optimal statistical properties, snch as asymptotic normality, speed of convergence and efficiency. In 
this article we focus on the Least Trimmed Squares estimator, which has the best statistical properties and 
is defined throngh a well structnred optimization problem. Let h be an integer number comprised between 
n/2 and n, and denote by |r|i:„ < \r\ 2 -.n < • • • < kUrn the residuals, ordered by increasing absolute value. 
The Least Trimmed Squares (LTS) estimator is defined as a solution of the problem: 


min Vr(/3)L 

/3gR‘' 

1 — 1 


(LTS) 


In words, the LTSE is the vector of regression coefficients jd that minimizes the snm of the h smallest 
squared distances from the hyperplane defined by fd to the observations yi. The LTSE attains the maximum 
asymptotic BDP of 50% by taking h = [n/2j + [(d + 1)/2J (Ronsseeuw and Leroy 1987|. 


1.2. State of the art 

The first approaches to the optimal snbset selection problem appeared in the field of pattern recongni- 


tion (Narendra and Fukunaga 1977 Yu and Yuan 1993 Chen 2003 Somol et al. 20041. Unlike robust 


regression, the featnre selection problem addressed there is a maximization problem, whose difficulties are 
somehow different from ours. To the best of our knowledge, the LTSE is the only reported application of 
snbset selection involving minimization. 

The exact compntation of high-BDP estimators for d greater than, say, 5 is a difficnlt global optimization 


problem (Bernholt 2005 Erickson et al. 2006 Monnt et al. 20141. Indeed, Erickson et al. (20061 provide 


resnlts snggesting that any exact algorithm for the related LMS requires, for large n, a time superior to 


k ■ for some constant k > 0. Monnt et al. (20141 extend this resnlt to the LTSE under consideration here 


proving that, up to a constant, the time required for computing the LTSE for a given coverage level h must 
be, for large n, bounded between {n/h)‘^ and 

For this reason, the overwhelming majority of the literatnre on computation of robnst estimators is 
focused on stochastic approximation algorithms. Most of these algorithms are constructed upon the basic 


resampling algorithm proposed originally by Ronsseenw and Leroy (19871 for computing the LMS estimator. 


Ronsseenw and Van Driessen (20061 developed a refined version including local improvements and adapted 


to the LTSE. Recently, Torti et al. (20121 condncted a benchmark of stochastic algorithms for approximating 


high-BDP linear regression estimators. The interested reader can find in that article an up-to-date account 
of stochastic approximation algorithms for robnst (though not high-BDP) linear regression. Stochastic 
approximation algorithms tipically provide good approximations to the actnal estimator for problems with a 
nnmber of observations in the hundreds or even in the few thonsands. However, they have some shortcomings 
as well. Two different runs or different implementations of the algorithm may give different results. Also, as 
it is not guaranteed that the obtained solution is a global minimum, nothing can be said abont the actnal 
breakdown point of snch approximations. Even a deterministic, constant-factor approximation to a high 
breakdown point estimator may have a very low breakdown point. 

That being said, for small or medium-sized datasets one may be disposed to invest more time to have a 
gnaranteed global solntion in return. Unfortnnately, despite the notable literature dedicated to stochastic 
approximations there exist very few deterministic algorithms for computing robnst estimators, exactly or 
approximately. We can mention the proposals by Steele and Steiger (19861 and Stromberg (19931 for 


computing the LMS estimator; both based on ennmeration of elemental subsets. For the particnlar case 
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of LMS regression with two predictors (d = 2), Mount et al. (20071 devised a Branch and Bound algorithm 
with an asymptotic running time of 0(n log 2 n). 

The first great step forward in the computation of the LTSE came with the Branch and Bound algorithm 
(BBA) of Agullo (20011. The BBA is an adaptation for minimization problems of the ‘feature subset 
selection’ branch and bound maximization algorithm by Narendra and Fukunaga (19771 and relies on the 
monotonicity of the problem. At a glance, the BBA enumerates all the subsets of h observations out of 
the n, by starting from the empty set and adding one observation at a time. Since the sum of the squared 
residuals increases when an additional observation is added to the LS fit, if a subset of observations with 
cardinality k < h is found to give a sum of squared residuals larger than that of the incumbent set, then 
by monotonicity all the sets containing that set can be discarded from further examination. The BBA is 
reported to be efficient in datasets with up to about n = 50 observations and d = 5 features. To a large 
extent, the difficulty of the BBA in tackling larger problems stems from the following limitations of the 
monotonicity bound 


— It does not provide useful information for small subsets: since in dimension d it is always possible to 
fit d observations exactly, it is impossible to obtain a non-trivial (positive) lower bound for the sum 
of squared residuals for a fit of observations containing a given subset of d or less observations. In 
the enumeration tree, this amounts to not having a lower bound to prune at the top d levels. 


— It does not look ahead: for instance, if n = 15 and h = 6 it gives “the sum of the squared residuals 
of a regression over six observations comprising observations 2, 5 and 8 is greater than the sum of the 
squared residuals of the regression over observations 2, 5 and 8” without quantifying the increase in 
sum of squared residuals due to the incorporation of three more observations. 


Consequently, as the dimension of the explicative variables increases, the BBA must examine a large 
number of elemental subsets, since pruning is possible only at the bottom levels of the enumeration tree, 
even if a good global upper bound is available beforehand. On the other hand, the BBA uses a quite efficient 
explicit formula for computing the increase in sum of squared residuals when adding one observation, which 
makes his algorithm quite efficient at the lower levels of the enumeration tree. 


Hofmann et al. (20101 extended the BBA for obtaining the LTSE for many coverage values h at once. 


besides improving the numerical linear algebra used to compute lower bounds. Very recently, [Bertsimas and| 
Mazumder (2014[| proposed a linear Mixed Integer Optimization (MIO) formulation of the Least Quantile 


of Squares (and in particular the LMS) regression problem. They report good results at solving to provable 
optimality problems of small (n = 100) and medium (n = 500) size for d = 3. Some impressive results are 
reported for approximate (albeit deterministic) solutions for problems with d < 20 and n in the order of 
10^. The BDP of the regression performed with approximate solutions is not reported. 


1.3. Innovations and contributions 

We model the computation of the LTSE as a nonconvex optimization problem comprising continuous 
and binary variables with nonlinear interdependence. Since the nonlinear coupling of the variables makes 
obtaining bounds on the continuous variables impractical, we devise a technique to obtain approximate 
bounds on the continuous variables; this is done by solving an Semi-Definite Programming (SDP) problem 
only once. Then we use the approximate bounds to obtain, via the Relaxation-Linearization Technique 
(RLT), a Second Order Cone Programming (SOCP) problem whose solution approximates the solution of 
the original nonconvex problem. Finally, the SOCP approximations are carefully used to obtain useful lower 
bounds at the top levels of the subset enumeration tree, where existing algorithms just pass through. 


1.4-. Outline of the paper 

We begin by showing, in Section the alternative formulations of the problem that permit to get rid 
of the order statistics involved in Problem (LTSI. More specifically, we show that the problem can be 
cast as a nonlinear mixed-integer program or a concave program, both particular cases of the best subset 
selection problem. Then, in Section we introduce approximate convexifications for products involving 
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binary variables and nonlinear continuous terms when an upper bound for the continuous variables is not 
available. We analyse the validity of the relaxation obtained using an estimation of the bound on the 
continuous variables and their applicability in a branch and bound algorithm. In Section]^ we show how to 
obtain an estimation of the bound on the continuous variables to be used for the approximate convexification 
using a known SDP reformulation of the concave maximization problem. Section describes the actual 
implementation of a branch and bound algorithm incorporating bounds from approximate convexifications. 
In Section]^ we present the results of a computational study showing the performance gains in a branch and 
bound algorithm due to SOCP bounds in a large set of problems. Finally, Section concludes the paper 
with a discussion on further avenues for research in this subject. 


2. Formulation as a best subset problem 


Problem (LTSI can be written as a mixed integer nonlinear program using the fact that for arbitrary 
r € K", if ri-n < r 2 :n Tn-.n denote its ordered components, then 

h ( Ti 


ri,n = min < Wi7 


2=1 




W 


G{0 .= h\, 


2=1 


and 


= min 


2 = 1 


2=1 


where Ch = {w G {0,1}^, X!r=i = /i} is a representation of all the subsets of {1,n} of size h. 

Therefore our original Problem ( LTS[ ) is equivalent to the following mixed-integer nonlinear programming 
problem: 


min 

i=l 

S.t 

r + Xj3 = y, 

e^w = /i, 

w G {0,1}",/3 G K'',r G 


( 2 ) 


where e is the n x 1 vector of ones. 

By splitting the variables of Problem we see that it is equivalent to 


min{ v{w) 

where v{w) is the value of the weighted LS problem 


G Cfi}, 


(3) 


;(w) = inf < 


Wkrl : 


.fc=i 


,r = y- XP 


(4) 


obtained by minimizing over /3 and r for fixed w G Ch in ©■ Hence, Problem © amounts to selecting the 
subset of h observations with the least sum of squared residuals. The function v defined in ([4| is co ncave, 
therefore Problem (LTS I can be thought as a concave minimization problem. [Giloni and Padberg ( 2002[| 
were the first to show this property, and used it to devise a local minimization procedure. [Nguyen and| 


Welsch (20101 revisited this formulation and derived an SDP formulation of the corresponding maximiza¬ 


tion problem. Unfortunately, the degeneracy of the feasible domain makes it difficult to apply concave 
minimization algorithms to problem (LTS I. In this paper we tackle the problem in the form given in ©■ 
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Figure 2: The BB tree for n = 6 and h = 3. 


2.1. The enumeration tree 

In Figure]^ we depict the enumeration tree constructed by the branch and bound algorithm, in a small 
example with h = 3 and n = 6. 

The circled nodes are the leaves; each leaf represents a subset of 3 observations (an element of Ch), which 
is obtained adding recursively the parent of each node until the root 0 is reached. For example, at the end of 
the second branch from right to left there are two leaves, the leaf at the right is associated to the subset of 
observations 6,4 and 3, and that at left to observations 5,4 and 3. In terms of the optimization variable w, 
they are associated to the points (0, 0,1,1,0,1)^ and (0,0,1,1,1,0)^ respectively. Any node has associated 
two index sets So and 5'i representing the variables fixed to 0 and 1 respectively, each of cardinality Jq and 
Ji. Using these two quantities we can compute the number of child nodes asn — Jq — h + 1 and the number 
of leaves that can be reached from the l-th child node as n — Ji — Jq — + 1. 


2.2. The monotonocity lower bound 

The value of the function v at a point w G [0,1]" gives the least sum of weighted squares of residuals 
with weights w. In particular il w G {0,1}", Wi = \ for i G J C {1, ...,n} and Wi = 0 for i ^ J, then v{w) 
is the sum of squares of the fit to the subset of observati ons J. The v alue of v{w) is finite as long as the 
matrix M{w) = D{w)X is invertible, and in this case (Agullo 20011, 


Agullo 2001 


v{w + Cj) = v{w) + 


1 + xjM{w)~^Xj ’ 


(5) 


where D{w) is the diagonal matrix formed from the vector w, rj{w) is the j-th residual obtained from 
the weighted least squares problem 0 with weights w, and ej is the j-th euclidean basis vector. Formula 
0 gives the change in the sum of squared residuals by adding observation j to the fit J, provided that J 
contains at least d linearly independent XiS. Note that v{w + ej) — v{w) > 0, which means that the objective 
function is non-decreasing from one node to any of its children. Therefore, the RSS at one node is a lower 
bound for the objective function of its children. If the tree is examined using a depth-first search strategy, 
it is possible to keep a Cholesky factorization of M{w) from which to perform rank-one updates in order 
to quickly compute the quantities xjM{w)~^Xj. The shorthands of the monotonicity bound were already 
mentioned: Formula 0 requires M(w) = X^D{w)X to be invertible, thus it is not applicable at the d 
top levels of the tree, and it provides loose bounds when Ji is far from h (when there are few more than d 
observations). 
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3. Pseudo-convexifications with approximate bounds 


At each node of the tree we need to (under) estimate the value of the problem: 


with the additional constraints 


min J^Wkrl, 

2=1 

S.t 

r + Xj3 = y, 
e^w = h, 

0 < ru < 1, 

/3 e e 1 

Wk = 1, k G Si 
Wk =0, k G So 


(6) 


(7) 


for two index sets Si, Sq particular to each node. Suppose that we had an upper bound 11^ for each quadratic 
term rl; then we can apply a linearization technique (Glover 1975 Adams et al.l |2004l |Adams and Sherali 


1990 19931 to get rid of the product of the continuous term rf. with the binary variable Wk- This is done by 


constructing a new continuous variable Uk to replace each product Wbrj., and adding a number of additional 
constraints in order to ensure that the value of the variable equals the product Wj-r^ at any feasible point 
of the new problem. Applied to Problem (§ this procedure yields to the following SOCP problem 


E Jl 

k=l'^k 

S.t 

rl -nfc(l - Wk) < Uk, 

Wk = 1 , 

Wk = 0 , 

(e, w) = h 
r + Xl3 = y, 

1, r e K",/? e 


1 < A; < n 
kG Si 
kG So 


(P) 


u G 


,u;e{o,ir. 


From the standard theory of Glover (19751, if 11^ > rl, where r are the residuals at a solution to 


Problem (LTSi, then Problem (0r coincides with ( |LTS[ ) for any binary realization of w. Unfortunately, 
as we show in the following section, the nonlinear coupling of the variables restrains us from efficiently 
obtaining a guaranteed upper bound for the residuals of Problem (LTS). For this reason we introduce 
pseudo-convexifications, which are instances of problem (|^ for an approximate upper bound (nfe)i<fe<„. 
We say that a solution to 0 is consistent if rf < 11^ for any k = l,...,n at the optimal r. Gonsistency 
of the solution to a pseudo-convexification is a necessary condition for being an actual convexification, but 
in general it is not sufficient. As a consequence, using bounds obtained from a pseudo-convexification in a 
branch and bound algorithm can lead to pruning branches potentially yielding a new solution. Of course, this 
drawback can be avoided by giving very large values to 11. Nevertheless, larger values of 11 yields to weaker 
lower bounds when relaxing the binary constraint. Therefore, we are led to find a compromise between 
a large bound ensuring the equivalence of Problems (LTS I and 0. and a smaller bound promoting tight 
lower bounds for Problem (|^. In the next section we describe a strategy for obtaining bounds achieving 
that comprise. 


4. Obtaining Good Approximate bounds 

It is customary when linearizing polynomial programs to suppose that the optimization takes place 
on a bounded separable polytope, and that upper and lower bounds for each variable exists and can be 
computed by linear programming. This does not hold in our case. In our problem the variable w ranges 
over the unit cube in K", and all the other variables have a nonlinear dependency on w. Indeed, for each 
w there exists an unique feasible as seen from (0; even more, /?„, is the unique solution to the system 
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X^D{w)Xl3 
r + Xfi = y. 
problem 


= X^D{w)y. The residuals are linked to jd, and therefore to w, by the linear constraint 
Therefore, an exact upper bound for can be obtained by solving the following auxiliary 


max 

W 

S.t 

e^w = h 
r + Xf3 = y, 
X^D{w)r = 0 
0 < w < 1. 


i^k) 


Problem is a maximization problem on w only but, unlike Problem (|^, it is not a concave maxi¬ 
mization problem. Even it it could be done efficiently, solving n problems would be cumbersome. For that 
reason we look for one single bound for all the residuals that can be efficiently computed. A closely related 
problem is that of maximizing the weighted sum of squared residuals, which amounts to maximizing the 
function v defined in 

v{w) 


max 

s.t 


e^w = q, 

0 < u; < 1, 


( 8 ) 


for some 1 < g < n. This is a concave maximization problem, therefore a global minimum can be efficiently 
computed. In fact, Nguyen and Welsch (20101 showed that Problem ([^ can be cast as an SDP problem, 
therefore it can be solved using standard widely-available software. Ifd-|-l<g<n and we force the 
variables to be binary, the solution to Problem (|^ is the subset of q observation with the largest sum 
of squared residuals. In any case, the value of the problem is monotone non-decreasing for 1 < q < n. 
Moreover, it has the property of going to -l-oo if any subset of the observations is replaced by divergent 
ones. After extensive numerical experiments we found that solving (|^ for q = d/2 yields to an effective 
approximation of the upper bound for the residuals. 


5. A branch and bound algorithm with SOCP bounds (S-BB) 


In our implementation, the Narendra-Fukunaga tree described in Subsection |2.1| is examined using a 
depth-first search strategy. We perform an in-level node ordering to take advantage of the unbalanced 
structure of the tree, as leftmost branches have many more children than those at the right. The innovations 
of our algorithm take place at the d top levels of the tree; at level d + 1 and below the algorithm behaves 


like the BBA of Agullo (2001). 


5.1. Preliminaries 

We used the LS estimator as an initial solution; as a consequence, our algorithm is deterministic. As 
indicated in Section]^ we set 11 equal to the optimal value of Problem ([^ with q = d/2. Problem (|^ is 
solved using the SDPT3 interior-point solver (Tiitiincii et al., 2003). 


5.2. Lower bounds 

Problem 0 with the binary constraint relaxed to 0 < Wfe < 1 is denoted as 0. The quadratic 
constraint — 11(1 — Wk) < Ufc can be cast as a SOCP constraint, for that reason we call Problem 0 the 
SOCP relaxation. Problem 0 was solved using CPLEX 12.5. At each node we first check the size of the 
subtree below each child node. Even nodes at top levels of the tree can have very few leaves below, in which 
case it is not worth spending time solving the SOCP relaxation. We launch the SOCP relaxation only for 
children with more than 10® leaves. If the solution to Problem 0 results to be consistent, and the value 
of the problem is greater than the current upper bound, the branch is pruned. Otherwise, the residuals of 
the solution are still useful for ranking the children and performing the in-level ordering, by putting the 
observations with the largest residuals at the left, to promote subsequent pruning of large branches. 












5.2.1. Adjusting the bounds on r 

A look at problem 0 shows that the optimal values of 11^ can be anticipated for k S S'o U S'!. For 
k G Si the upper bound does not enter into play, we always have Uk = on the contrary, for k G Sq we 

should always have Uk = 0, therefore we set 11^ = 10 • H, where 11 is the upper bound obtained by solving 
Problem ([^ and used for k ^ SqU Si. In practice this forces Uk = 0, and does not spoil the conditioning of 
the SOCP problems as a huge number would do (in theory, we should set those 11^ to +oo). 


5.3. Local improvements 

Another innovation of our BB algorithm is the incorporation of a local search. Each time a leaf is 
examined, we apply the concentration steps of Rousseeuw and Van Driessen (20061 to obtain an eventually 
better incumbent solution, and use it to update the global upper bound of the algorithm. 


6. Computational study 
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Figure 3: A taxonomy of regression outliers. 


Now we illustrate through a computational study the impact of incorporating the SOCP lower bound 
in a branch-and-bound algorithm. In order to perform a systematic study, we generated synthetic datasets 
with sizes in a controlled range. It is known that the structure of the outliers, and not only their magnitudes, 
strongly affects the regression technique as well as the behaviour of the approximating algorithm. In Figure 
1^ we illustrate the taxonomy of linear regression outliers (Rousseeuw and van Zomeren 19901. Outliers are 
called vertical if only the y component (the response) is contaminated. Vertical outliers are the more benign 
ones, and even some convex estimators, such as the li estimator, can cope with them to some extent (Giloni 
and Padberg, 2004[ |. Leverage points are points whose explicative variables are corrupted. In contrast to 
vertical outliers, leverage outliers can be very harmful. Excepting the case of the “good leverage points” 
illustrated in Figure which are in general not considered as outliers, bad leverage points are the most 
adversarial type of contamination. 

For our study we generate synthetic data with outliers in the following way: 

- We generate regular observations following model Q, with 5 standard Normal, for different number 
of cases (n) and explicative variables (d). 


- On each dataset we replace 10 regular observations by bad leverage outliers. 


The bad leverage outliers were obtained by shifting randomly selected observations in two different ways: 
by a large, deterministic shift (high leverage outliers) and by adding a random term drawn form a Laplace 
distribution (heavy tail outliers). 
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For each combination n/d/type-of-contamination we drew 25 datasets as described above and measured 
the total time spent by the S-BB algorithm described in Section and by the BBA (Agullo 20011 in 
computing the LTSE with a breakdown point of 50% {h = [n/2j + [{d + 1)/2J). 

All computations were done in MATLAB version R2008b on a 64-bit Linux machine, with 8 cores and 6 
GB RAM. For the SOCP relaxations we used IBM ILOG GPLEX Studio v. 12.5 via its MATLAB interface. 

The computing times in tens of seconds, averaged over the 25 repetitions, are shown in Tables and 
respectively. The impact of the SOGP bounds is, as expected, more important as the number of explicative 
variables increases, and more pronounced for larger n. The reduction in computing time exceeds the 20% 
for n = 40 and d greater than 15. The accuracy of the solution is largely preserved; in Table we show the 
rate of success, which is larger than 99% in all but one of the cases. The computing times in Tables and 
[^marked with a dagger are averages excluding the run that did not gave the exact solution. 

Further reductions in computing time are possible by relaxing the optimality goal. In this direction, we 
can mention that the fraction of the SOGP relaxations resulting in inconsistent solutions is not negligeable; 
using those solutions to derive bounds could result in a great performance improvement. Another way to 
do the same thing is by decreasing the parameter q used to obtain the approximate bound 11. However, the 
goal of this work was to improve the computation of the LTSE with the least possible lost in accuracy, and 
it was achieved. 


d 


Alg. 

n 

12 

13 

14 

15 

16 

17 

18 


30 

0.51 

0.35 

0.51 

0.26 

0.36 

0.14 

0.18 

BBA 

35 

5.19 

4.79 

7.99 

5.68 

8.72 

5.02 

7.29 


40 

22.15 

23.68 

40.85 

33.59 

56.63 

38.99 

59.27 


30 

0.49 

0.34 

0.5H 

0.26 

0.35 

0.14 

0.18 

S-BB 

35 

4.60 

4.21 

6.80 

5.051' 

7.651 

4.45 

6.411 


40 

22.53 

20.89 

34.90 

26.91 

45.44 

29.87 

46.49 


Table 1: Computing times (in tens of seconds) for high leverage outliers 


7. Conclusions and perspectives 

We have presented an approximate convex relaxation for the LTS problem. Its incorporation in a 
branch-and-bound algorithm yields to significant savings in computing time at the price of a negligeable 
accuracy lost. Our standpoint is that of improving the computation of an estimator with well-studied 
statistical properties. Now then, with the understanding of the underlying problem gained with this study 
one could propose alternative techniques yielding to robust estimators defined with an eye on computability. 
Goncretely, think of the estimator defined as the solution to Problem ©. with n obtained by solving ([^ for 
some 1 < g < n. If the binary constraint is relaxed, such an estimator is defined by two convex optimization 


d 


Alg. 

n 

12 

13 

14 

15 

16 

17 

18 


30 

0.52 

0.35 

0.52 

0.26 

0.35 

0.14 

0.19 

BBA 

35 

5.26 

4.81 

8.07 

5.67 

8.69 

5.03 

7.29 


40 

21.41 

23.90 

41.41 

34.82 

56.57 

39.71 

59.47 


30 

0.50 

0.34 

0.52 

0.261 

0.35 

0.14 

0.19 

S-BB 

35 

4.52 

4.21 

6.86 

4.83 

7.71 

4.481 

6.53 


40 

22.11 

21.43 

34.70 

27.46 

44.54I 

31.44 

46.72 


Table 2: Computing times (in tens of seconds) for heavy tail outliers 
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n 


_%_ 30 35 

high leverage outliers 99.55 98.66 
heavy tail outliers 99.55 99.55 
average 99.55 99.10 


40 average 

loo 99.40 

99.55 99.55 

99.77 99.47 


Table 3: Success rates of the S-BB algorithm 


problems, with a range of applicability in the thousands of observations; if the binary constraint is kept, 
we dispose of SOCP convex relaxations at any node of the BB tree, without wasting time with inconsistent 
relaxations. The study of the statistical and computational aspects of that proposal will be the subject of 
a future work. 
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